7 research outputs found

    Efficient estimation of evolutionary distances

    Get PDF
    The advent of high throughput sequencers has lead to a dramatic increase in the size of available genomic data. Standard methods, which have worked well for many years, are not suitable for the analysis of big data sets, due to their reliance on a time-consuming alignment step. In this thesis, a new alignment-free approach for phylogeny reconstruction is introduced. The corresponding program, andi, is orders of magnitude faster than classical approaches and also superior to comparable alignment-free methods. The central data structure in andi is the enhanced suffix array. It is used to find long exact matches between sequences. In this thesis, various approaches to the construction of enhanced suffix arrays, including novel ones, are evaluated with respect to performance. Additionally, a new parallel algorithm for the computation of suffix arrays is introduced

    Fur: Find unique genomic regions for diagnostic PCR

    Get PDF
    Unique marker sequences are highly sought after in molecular diagnostics. Nevertheless, there are only few programs available to search for marker sequences, compared to the many programs for similarity search. We therefore wrote the program Fur for Finding Unique genomic Regions.Fur takes as input a sample of target sequences and a sample of closely related neighbors. It returns the regions present in all targets and absent from all neighbors. The recently published program genmap can also be used for this purpose and we compared it to fur. When analyzing a sample of 33 genomes representing the major phylogroups of E.coli, fur was 40 times faster than genmap but used three times more memory. On the other hand, genmap yielded three times more markers, but they were less accurate when tested in silico on a sample of 237 E.coli genomes. We also designed phylogroup-specific PCR primers based on the markers proposed by genmap and fur, and tested them by analyzing their virtual amplicons in GenBank. Finally, we used fur to design primers specific to a Lactobacillus species, and found excellent sensitivity and specificity in vitro.Fur sources and documentation are available from https://github.com/evolbioinf/fur. The compiled software is posted as a docker container at https://hub.docker.com/r/haubold/fox.Supplementary data are available at Bioinformatics online

    Fast computation of genome distances

    No full text
    To understand the evolutionary relationships between organisms, they are typicallypresented in a tree-like structure, a phylogeny. In genomic studies, phylogenies aretraditionally reconstructed from a multiple sequence alignment. While most accurate,this approach is also computationally demanding. The problem is that in order to identifyshared homologies, the sequences are usually first aligned nucleotide by nucleotide.This alignment step has become a bottleneck in the practice of molecular biology, wherethousands of whole bacterial genomes, each a few megabases long, are sequenced andthen need to be summarized as phylogenies when analyzing pathogen outbreaks.One alternative are methods that estimate evolutionary distances directly from un-aligned genomes. These pairwise distances can then be used to cluster sequences in atree. Most of these alignment-free methods heavily rely on exact matching techniques forwords of a fixed size for fast sequence comparison. However, they usually do not reflectthe substitution rate, the most widely used measure of evolutionary distance.Instead of using words of fixed size, Haubold et al. (2015) used matches of maximallength as anchors for approximate pairwise alignments. These anchor alignments then canbe used to estimate the substitution rate. A first implementation,andi, quickly estimatesaccurate pairwise distances from hundreds of bacterial genomes on standard hardware.However, the thousands of genomes currently being collected during outbreaks againslow the program down.Andiuses a suffix array as a full-text index for each of the input sequences. Since con-structing and searching in a suffix array is slow, the aim of this thesis was to investigate,whether it might be possible to just compute a single suffix array for one of the inputsequences and pile all remaining sequences onto that reference. This should produce anapproximate multiple sequence alignment, from which pairwise mismatches could becounted.This approach is implemented in the programphylonium(Klötzl and Haubold 2019). Itis available via package managers or as open source atgithub.com/evolbioinf/phylonium.Phyloniumis much faster thanandiwhile losing little of its predecessor ’s accuracy. In thisthesis I explain the background tophylonium, describe its implementation, and applyit to simulated and real data. In the application section I comparephyloniumto its bestcompetitors and show that it holds a reasonable position in the classical trade-off betweenspeed and accuracy

    Phylonium: fast estimation of evolutionary distances from large samples of similar genomes

    No full text
    Tracking disease outbreaks by whole-genome sequencing leads to the collection of large samples of closely related sequences. Five years ago, we published a method to accurately compute all pairwise distances for such samples by indexing each sequence. Since indexing is slow, we now ask whether it is possible to achieve similar accuracy when indexing only a single sequence.We have implemented this idea in the program phylonium and show that it is as accurate as its predecessor and roughly 100 times faster when applied to all 2678 Escherichia coli genomes contained in ENSEMBL. One of the best published programs for rapidly computing pairwise distances, mash, analyzes the same dataset four times faster but, with default settings, it is less accurate than phylonium.Phylonium runs under the UNIX command line; its C++ sources and documentation are available from github.com/evolbioinf/phylonium.Supplementary data are available at Bioinformatics online

    Support values for genome phylogenies

    No full text
    We have recently developed a distance metric for efficiently estimating the number of substitutions per site between unaligned genome sequences. These substitution rates are called “anchor distances” and can be used for phylogeny reconstruction. Most phylogenies come with bootstrap support values, which are computed by resampling with replacement columns of homologous residues from the original alignment. Unfortunately, this method cannot be applied to anchor distances, as they are based on approximate pairwise local alignments rather than the full multiple sequence alignment necessary for the classical bootstrap. We explore two alternatives: pairwise bootstrap and quartet analysis, which we compare to classical bootstrap. With simulated sequences and 53 human primate mitochondrial genomes, pairwise bootstrap gives better results than quartet analysis. However, when applied to 29 E. coli genomes, quartet analysis comes closer to the classical bootstrap

    andi: fast and accurate estimation of evolutionary distances between closely related genomes

    No full text
    Motivation: A standard approach to classifying sets of genomes is to calculate their pairwise distances. This is difficult for large samples. We have therefore developed an algorithm for rapidly computing the evolutionary distances between closely related genomes. Results: Our distance measure is based on ungapped local alignments that we anchor through pairs of maximal unique matches of a minimum length. These exact matches can be looked up efficiently using enhanced suffix arrays and our implementation requires approximately only 1 s and 45 MB RAM/Mbase analysed. The pairing of matches distinguishes non-homologous from homologous regions leading to accurate distance estimation. We show this by analysing simulated data and genome samples ranging from 29 Escherichia coli/Shigella genomes to 3085 genomes of Streptococcus pneumoniae. Availability and implementation: We have implemented the computation of anchor distances in the multithreaded UNIX command-line program andi for ANchor DIstances. C sources and documentation are posted at http://github.com/evolbioinf/andi/ Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    hotspot: software to support sperm-typing for investigating recombination hotspots

    No full text
    MOTIVATION: In many organisms, including humans, recombination clusters within recombination hotspots. The standard method for de novo detection of recombinants at hotspots is sperm typing. This relies on allele-specific PCR at single nucleotide polymorphisms. Designing allele-specific primers by hand is time-consuming. We have therefore written a package to support hotspot detection and analysis. RESULTS: hotspot consists of four programs: asp looks up SNPs and designs allele-specific primers; aso constructs allele-specific oligos for mapping recombinants; xov implements a maximum-likelihood method for estimating the crossover rate; six, finally, simulates typing data. AVAILABILITY AND IMPLEMENTATION: hotspot is written in C. Sources are freely available under the GNU General Public License from http://github.com/evolbioinf/hotspot/ CONTACT: [email protected] information: Supplementary data are available at Bioinformatics online
    corecore